An investigation of the performance portability of OpenCL
نویسندگان
چکیده
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5x slower than native FORTRAN or CUDA implementations on a single node and 1.3–3.1x slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3x speed-up over our original OpenCL implementation.
منابع مشابه
Improving Performance Portability in OpenCL Programs
We study the performance portability of OpenCL across diverse architectures including NVIDIA GPU, Intel Ivy Bridge CPU, and AMD Fusion APU. We present detailed performance analysis at assembly level on three exemplar OpenCL benchmarks: SGEMM, SpMV, and FFT. We also identify a number of tuning knobs that are critical to performance portability, including threads-data mapping, data layout, tiling...
متن کاملEnhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations
متن کامل
Cross-Platform OpenCL Code and Performance Portability for CPU and GPU Architectures Investigated with a Climate and Weather Physics Model
Current multiand many-core computing typically involves multi-core Central Processing Units (CPU) and many-core Graphical Processing Units (GPU) whose architectures are distinctly different. To keep longevity of application codes, it is highly desirable to have a programming paradigm to support these current and future architectures. Open Computing Language (OpenCL) is created to address this p...
متن کاملDealing with performance/portability and performance/accuracy trade-offs in heterogeneous computing systems: A case study with matrix multiplication modulo primes
We present the study of two important trade-offs in heterogeneous systems (i.e., between performance versus portability and between performance and accuracy) for a relevant linear algebra problem, matrix multiplication modulo primes. Integer matrix linear algebra methods rely heavily on matrix multiplication modulo primes. Double precision is necessary for exact representation of sufficiently m...
متن کاملPatterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)
Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability. Code is either tuned using device-specific optimizations to achieve maximum perfo...
متن کاملExploiting heterogeneous parallelism with the Heterogeneous Programming Library
While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper we present the Heterogeneous Programming Library (HPL),...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Parallel Distrib. Comput.
دوره 73 شماره
صفحات -
تاریخ انتشار 2013